Search CORE

12 research outputs found

Halvade: whole genome analysis with MapReduce

Author: Costanza P
Decap Dries
Fostier Jan
Herzeel C
Reumers J
Publication venue
Publication date: 01/01/2014
Field of study

Ghent University Academic Bibliography

Context-oriented software transactional memory in common lisp

Author: Cao Minh C.
Charlotte Herzeel
Herzeel C.
Kulkarni M.
Larus J. R.
Pascal Costanza
Theo D'Hondt
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

elPrep: high-performance preparation of sequence alignment/map files for variant calling

Author: C Raczy
Charlotte Herzeel
Christophe Antoniewski
D Decap
Dries Decap
G Cochrane
G Faust
G Tischler
G Van der Auwera
H Li
H Li
Jan Fostier
Joke Reumers
M DePristo
M Fritz
Pascal Costanza
R Blumofe
R Guimera
R Luo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 31/12/2014
Field of study

elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1: 40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost

Crossref

Ghent University Academic Bibliography

Directory of Open Access Journals

PubMed Central

Scipedia

FigShare

Constructing Customized Interpreters from Reusable Evaluators Using Game

Author: A. Rigo
C. Hanson
C. Herzeel
H. Abelson
J.-P. Talpin
J.C. Reynolds
L.R. Nielsen
M. Anton Ertl
M. Snyder
M.S. Ager
O. Danvy
P. Graunke
S. Diehl
S. Liang
S.E. Ganz
T. Rompf
T. Schrijvers
W.D. Clinger
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

A highly efficient multi-core algorithm for clustering extremely large datasets

Author: A Ben-Hur
A Bertoni
A Jain
AK Jain
AR Adl-Tabatabai
AWF Edwards
B Andreopoulos
B Chapman
C Herzeel
Consortium IH
D Lea
D Smirnov
DR Barr
E Levine
F Müller
G Dalgin
HA Kestler
HA Kestler
Hans A Kestler
HW Kuhn
J Fridlyand
J Handl
J Larus
J MacQueen
Johann M Kraus
JW Sammon
K Fukunaga
L Hubert
L Kuncheva
M Anderson
M Ng
MK Kerr
N Shavit
P Jaccard
P Sham
PA Bernstein
R Development Core Team
R Duan
R Graham
R Jonker
R Rajwar
R Tibshirani
R Xu
RC Gentleman
S Monti
S Peyton-Jones
S Selim
T Kohonen
T Lange
U Drepper
W Feng
W Gropp
W Rand
WJ Conover
X Gao
X Gao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.</p

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Scaling Genomics Data Processing with Memory-Driven Computing to Accelerate Computational Biology

Author: A Tarasov
C Herzeel
CE Cook
D Firnkorn
D Kim
F Fröhlich
H Li
J Köster
KM Bresniker
L Chua
M Alser
R Kaplan
TN Theis
W Saelens
X Li
Publication venue
Publication date: 15/06/2020
Field of study

Research is increasingly becoming data-driven, and natural sciences are not an exception. In both biology and medicine, we are observing an exponential growth of structured data collections from experiments and population studies, enabling us to gain novel insights that would otherwise not be possible. However, these growing data sets pose a challenge for existing compute infrastructures since data is outgrowing limits within compute. In this work, we present the application of a novel approach, Memory-Driven Computing (MDC), in the life sciences. MDC proposes a data-centric approach that has been designed for growing data sizes and provides a composable infrastructure for changing workloads. In particular, we show how a typical pipeline for genomics data processing can be accelerated, and application modifications required to exploit this novel architecture. Furthermore, we demonstrate how the isolated evaluation of individual tasks misses significant overheads of typical pipelines in genomics data processing

Crossref

Scipedia